Hanqi Guo,
Jie Liu,
Xiaoru Yuan,
We developed a set of toolkits designated for this
task, including twogeneral components, QVisualizer and DestIPTree respectively.
QVisualizer organizes and visualizes prox and
network traffic data according to the occurring time stamps. The information of
one person of a particular day is shown as small color stripes in one row.
Information can be organized to show each person’s activities of the
whole month in the person-view mode, or show everyone’s activities in one
specified day in day-view mode. Filtering operations are also implemented to
facilitate the data exploration and analysis.
DestIPTree is designed to analyze relationships
between network traffic and each computer’s behavior employing the
Treemap technique. All destination IP addresses are organized into a 4-level
tree according to IP bytes. The sizes of grids of the treemaps are measured by
the visit count of certain address. And each grid has a color defined by the
size of data uploaded to this address. Finally, the network records of each
computer are overlapped the IP treemaps.
QVisualizer and DestIPTree are linked together. We
use QVisualizer as the major tool to discover abnormal behaviors of employees and
computers. DestIPTree is utilized to find out network traffic patterns.
Video:
ANSWERS:
MC1.1: Identify which computer(s) the employee most likely used to send
information to his contact in a tab-delimited table which contains for each
computer identified: when the information was sent, how much information was
sent and where that information was sent.
MC1.2: Characterize the patterns of behavior of suspicious computer use.
The statistics of all data records reveals that
there are 115414 network access operations, and 20243 IP addresses are visited
in total, including 19221 visited-once IP addresses. The IP port might be
helpful to filter addresses, since port 80 is usually used by web servers and
25 by mail servers. 8080 is more complicated, one of its usages is for proxy.
We sorted different fields of datasets, and found
out that all visited-once IP address use port 80. The uploaded data size ranges
from 100 to 13687307, and the downloaded data sizes ranges from 2045 to
10000000.
The native data are stored in two types; one is the
prox data, and the other the IP log data. We implemented a tool which has two
windows to display and compare them. The prox events and network accesses are
located on timelines in the window respectively. According to the background,
the employees are supposed to use their own machines, so we assume that we can create correspondences between the
employee ID and the IP address. The
comparison results with our tool make a further proof of the assumption, thus
we can combine the two sources of data together. The two windows can also be
merged into one, and that is the basic concept of QVisualizer.
Further more, we attempted to discover the patterns
of network accesses. As facing large amount of IP addresses, we initially
viewed the IP addresses and user IDs in a matrix. The IP addresses were sorted
by date and time, and an LOD scheme was introduced here to help observation.
However, we did not get too much useful information. To avoid its
disadvantages, the Treemap technique is utilized. The user IDs are represented
in spots, overlapped on the Treemap, so that we can easily get their
relationships (Figure 1).
Figure 1 Overview of
our program
We observe data features via our tools. We soon
found some abnormal events against our previous “one user one
machine” rule. Therefore it is easy for us to design an automatic
exception detection scheme based on the rule. If the employee is in the
classified room, in theory, his computer should not upload any dataset through
internet. So if there are upload records in this period, we consider it as an
exception. The QVisualizer reports 8 records in this case (Figure 2).
Figure 2
Auto-detection of abnormal events
We also manually check datasets in QVisualizer, the
flexible display modes and interactions allow us fast view all datasets. For
instance, employee #13 usually comes to work at around 10:00am, but on Jan.
22nd, his computer uploaded a large number of data at 8:50am, but the log
system indicates he came to office as usual, at 10:00am. We compared this case
with the previous 8 records by printing all their information, and common
features are discovered: All these suspicious network accesses point to the
same IP address (100.59.151.133), and request-response ratios are also quite
large values. The patterns are probably found (Figure 3).
Figure 3 Detection
of the abnormal activity of employee #13
We use the brush tool of QVisualizer to trace all
network access records having the same destination IP, and found other 9
records. They do not illustrate obvious exceptional activities, but we still
get some patterns. All these activities uploaded data of large size, and
display single symbols on QAnalyzer’s timeline, active network accesses
happened long before or after this single event. For example, employee #18, at
15:15pm, Jan. 17th, uploaded 12398 bytes of data, and two hours later, at
17:57pm, he accessed the suspicious IP address, and uploaded 5873546 bytes of
data. What’s more, there is no other activity record for employee #18 after
17:20pm in all 30 days
(Figure 4).
Figure 4 Trace other
abnormal activities
All evidences above can characterize the patterns
of behavior of suspicious computer use. All abnormal network accesses direct to
the same IP address, and upload great deal of data. The time is also uncommon,
they usually happened on computers, user of which might be in the classified
area at the same time, or they are the only network activity during a long
period of time.
At last, we try to use these evidences and data
records to find the suspected employee. As we conceded, there are many
computers leaking data, but there is only one suspected employee. So this man
is very likely to use other people’s computers to send data when they are
not in the office. We check everyone to investigate whether he is in the
security area when his machine is leaking data and if so, exclude him from our
suspect list. We finally get employee #27 who matches characteristics listed
above well.